Module overview

  • Week 9. Describing Relationships
    • Correlation (calculation, interpretation)
    • Regression (model structure, model fitting
    • What/when/why/how
  • Week 10. Simple Linear Regression
    • Can we use the model?(assumptions, hypothesis testing)
    • How good is the model?(interpretation, model fit)
  • Week 11. Multiple Linear Regression
    • Multiple Linear Regression (MLR) modelling
    • Assumptions, interpretation and the principle of parsimony
  • Week 12. Nonlinear Regression
    • Common nonlinear functions]
    • Transformations]

Regressions

Simple linear regression

Y_i = \beta_0 + \beta_1 x_i + \epsilon_i

Ideal for predicting a continuous response variable from a single predictor variable: “How does y change as x changes, when the relationship is linear?”

Multiple linear regression

Y_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \beta_k x_{ki} + \epsilon_i

“How does y change as x_1, x_2, …, x_k change?”

Nonlinear regression

Y_i = f(x_i, \beta) + \epsilon_i

where f(x_i, \beta) is a nonlinear function of the parameters \beta: “How do we model a change in y with x when the relationship is nonlinear?”

Nonlinear regression

Carl Friedrich Gauss (1777-1855) and Isaac Newton (1642-1726) Gauss-Newton approach to non-linear regression is most commonly used

Fitting a nonlinear model

Linear relationships are simple to interpret since the rate of change is constant.

“As one changes, the other changes at a constant rate.”

Nonlinear relationships often involve exponential, logarithmic, or power functions.

“As one changes, the other changes at a rate that is not proportional to the change in the other.

Nonlinear relationships: exponents

  • x^2 is the square of x.
  • x^3 is the cube of x.
  • x^a is x raised to the power of a.

In a relationship where y is a function of x^a, as y increases, x increases at a rate that is equal to x to the power of a.

Code
# Plot a simulation of above in ggplot2
set.seed(123)
tibble(x = seq(0, 10, by = 0.2), y = x^2) %>%
  ggplot(aes(x = x, y = y)) +
  geom_point() +
  labs(x = "x", y = "y") +
  ggtitle(expression(y == x^2)) +
  theme(plot.title = element_text(size = 40, face = "bold"))

Nonlinear relationships: logarithms

  • log_e(x) is the natural logarithm of x.
  • log_{10}(x) is the common logarithm of x.
  • log_a(x) is the logarithm of x to the base a.

Interpretation:

  • If \log_a(y) = x: as x increases, y increases at a rate of y = a^x.
  • If y = \log_a(x): as y increases, x also increases, at x = a^y.

Exponents and logarithms

Exponents Logarithms
Definition If a^n = b, a is the base, n is the exponent, and b is the result. If \log_a b = n, a is the base, b is the result, and n is the logarithm (or the exponent in the equivalent exponential form).
Example 2^3 = 8 \log_2 8 = 3
Interpretation 2 raised to the power of 3 equals 8. The power to which you must raise 2 to get 8 is 3.
Inverse The logarithm is the inverse operation of exponentiation. The exponentiation is the inverse operation of logarithm.
Properties (a^n)^m = a^{n \cdot m}, a^n \cdot a^m = a^{n+m}, \frac{a^n}{a^m} = a^{n-m} \log_a(b \cdot c) = \log_a b + \log_a c, \log_a\left(\frac{b}{c}\right) = \log_a b - \log_a c, \log_a(b^n) = n \cdot \log_a b

Note

For your understanding, don’t need to memorise.

Dealing with nonlinearity

Transformations

Often, a nonlinear relationship may be transformed into a linear relationship by applying a transformation to the response variable or the predictor variable(s).

  • Logarithmic: y = \log(x)
  • Exponential: y = e^x
  • Square-root: y = \sqrt{x}
  • Inverse: y = \frac{1}{x}
  • All good when y changes monotically with x.
  • What if relationship is not monotonic, or is more complex?

Common nonlinear functions

f(x_i, \beta)

Exponential decay relationship

Response variable decreases and approaches limit as predictor variable increases.

y = a \cdot e^{-b_x}

Code
set.seed(429) # set seed
# Simulate data:
decay <- tibble(
  predictor = seq(0,10, by = 0.2),
  response = abs(exp(-0.5*predictor) + rnorm(length(predictor), mean = 1, sd = 0.1)))

ggplot(data = decay, aes(x = predictor, y = response)) +
  geom_point() +
  labs(x = "Predictor", y = "Response")

Examples: radioactive decay, population decline, chemical reactions.

Asymptotic relationship

Response variable increases and approaches a limit as the predictor variable increases.

y = a + b(1 - e^{-cx})

Code
set.seed(442) # set seed
# Simulate data:
asymptotic = tibble(
  predictor = seq(0, 10, by = 0.2),
  response = 100*(1-exp(-0.5*predictor)) + rnorm(length(predictor), mean = 0, sd = 10))

ggplot(data = asymptotic, aes(x = predictor, y = response)) +
  geom_point() +
  labs(x = "Predictor", y = "Response")

Examples: population growth, enzyme kinetics.

Logistic relationship

An S-shaped relationship, where the response variable is at first exponential, then asymptotic.

y = c + \frac{d-c}{1+e^{-b(x-a)}}

Code
set.seed(450)
# Simulate data:
logistic <- tibble(predictor = seq(0, 10, by = 0.2), 
  response = 10 + abs(300 * (1 / (1 + exp(-0.8 * (predictor - 5)))) + rnorm(length(predictor), mean = 0, sd = 10)))

ggplot(data = logistic, aes(x = predictor, y = response)) +
  geom_point() +
  labs(x = "Predictor", y = "Response")

Examples: growth of bacteria, disease spread, species growth.

Polynomial relationship

Response variable changes in a variety of ways as the predictor variable changes. Also known as ‘curvilinear’.

y = a + bx + cx^2 + dx^3 + ...

Code
# Set seed for reproducibility
set.seed(529)
# Simulate data:
curvilinear <- tibble(predictor = seq(0, 30, length.out = 50), 
  response = 50 * (1 - (predictor - 15)^2 / 225) + rnorm(length(predictor), mean = 0, sd = 5))

ggplot(data = curvilinear, aes(x = predictor, y = response)) +
  geom_point() +
  labs(x = "Predictor", y = "Response")

Examples: food intake, drug dosage, exercise.

Transformations

How far can we go?

Transformations: exponential decay

Before transformation

Code
ggplot(data = decay,
       aes(x = predictor, y = response)) +
  geom_point() +
  labs(x = "Predictor", y = "Response")

After loge transform

Code
ggplot(data = decay, 
       aes(x = predictor, y = log(response))) +
  geom_point() +
  labs(x = "Predictor", y = "Response")

Transformations: exponential decay

Before transformation

Code
autoplot(lm(response ~ predictor, data = decay)) +
  labs(x = "Predictor", y = "Response")

After loge transform

Code
autoplot(lm(log(response) ~ predictor, data = decay)) +
  labs(x = "Predictor", y = "Response")

Transformations: asymptotic relationship

Before transformation

Code
ggplot(data = asymptotic,
       aes(x = predictor, y = response)) +
  geom_point() +
  labs(x = "Predictor", y = "Response")

After loge transform

Code
ggplot(data = asymptotic, 
       aes(x = log(predictor), y = response)) +
  geom_point() +
  labs(x = "Predictor", y = "Response")

Transformations: asymptotic relationship

Before transformation

Code
autoplot(lm(response ~ predictor, data = asymptotic)) +
  labs(x = "Predictor", y = "Response")

After loge transform

Code
autoplot(lm(log(response) ~ predictor, data = asymptotic)) +
  labs(x = "Predictor", y = "Response")
Warning in log(response): NaNs produced

Transformations: logistic relationship

Before transformation

Code
ggplot(data = logistic,
       aes(x = predictor, y = response)) +
  geom_point() +
  labs(x = "Predictor", y = "Response")

After loge transform

Code
ggplot(data = logistic, 
       aes(x = predictor, y = log(response))) +
  geom_point() +
  labs(x = "Predictor", y = "Response")

Transformations: logistic relationship

Before transformation

Code
autoplot(lm(response ~ predictor, data = logistic)) +
  labs(x = "Predictor", y = "Response")

After loge transform

Code
autoplot(lm(log(response) ~ predictor, data = logistic)) +
  labs(x = "Predictor", y = "Response")

Transformations: polynomial relationship

Before transformation

Code
ggplot(data = curvilinear,
       aes(x = predictor, y = response)) +
  geom_point() +
  labs(x = "Predictor", y = "Response")

After loge transform

Code
ggplot(data = curvilinear, 
       aes(x = predictor, y = log(response))) +
  geom_point() +
  labs(x = "Predictor", y = "Response")
Warning in log(response): NaNs produced
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

Transformations: polynomial relationship

Before transformation

Code
autoplot(lm(response ~ predictor, data = curvilinear)) +
  labs(x = "Predictor", y = "Response")

After loge transform

Code
autoplot(lm(log(response) ~ predictor, data = curvilinear)) +
  labs(x = "Predictor", y = "Response")
Warning in log(response): NaNs produced

Did the transformations work?

  • To a certain extent…
  • Problems:
    • Relationships typically do not meet the linear assumption, but seem “ok” for other assumptions.
    • Poor fit to the data (over or underfitting in some areas).
    • Difficult to interpret the results.

Nonlinear regression

  • A way to model complex (nonlinear) relationships.
    • i.e. phenomena that arise in the natural and physical sciences e.g. biology, chemistry, physics, engineering.
  • At least one predictor is not linearly related to the response variable.

Performing nonlinear regression

  • Polynomial regression: still linear in the parameters and a good place to start.
  • Nonlinear regression: use the nls() function to fit the following nonlinear models:
    • Exponential growth
    • Exponential decay
    • Logistic

Polynomial regression

A special case of multiple linear regression used to model nonlinear relationships.

Model

Y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + ... + \beta_k x_i^k + \epsilon_i

where k is the degree of the polynomial.

  • The model is still linear in the parameters \beta and can be fitted using least squares.
  • Instead of multiple predictors, we have multiple terms of the same predictor (same x).
  • Can still be fit using lm().
  • The more complex, the less likely it follows a true biological relationship…


Adding polynomial terms

  • Linear: y = \beta_0 + \beta_1 x
  • Quadratic: y = \beta_0 + \beta_1 x + \beta_2 x^2
  • Cubic: y = \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3
  • Each level increases the power of the predictor by 1.

Polynomial fitting

Using the asymptotic data

The data

See Slide 11 for the relationship and mathematical expression.

Code
ggplot(asymptotic, aes(x = predictor, y = response)) +
  geom_point() +
  labs(x = "Predictor", y = "Response")

Fitting the model (linear)

Y_i = \beta_0 + \beta_1 x_i + \epsilon_i

Code
lin_fit <- lm(response ~ predictor, asymptotic)
Code
ggplot(asymptotic, aes(x = predictor, y = response)) +
  geom_point() +
  labs(x = "Predictor", y = "Response") +
  geom_line(aes(y = predict(lin_fit)), color = "red", size = 2)
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

Fitting the model (poly(degree = 2))

Y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \epsilon_i

Code
poly2_fit <- lm(response ~ poly(predictor, 2), asymptotic)
Code
ggplot(asymptotic, aes(x = predictor, y = response)) +
  geom_point() +
  labs(x = "Predictor", y = "Response") +
  geom_line(aes(y = predict(lin_fit)), color = "red") +
  geom_line(aes(y = predict(poly2_fit)), color = "slateblue", size = 2)

Fitting the model (poly(degree = 3))

Y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \beta_3 x_i^3 + \epsilon_i

Code
poly3_fit <- lm(response ~ poly(predictor, 3), asymptotic)
Code
ggplot(asymptotic, aes(x = predictor, y = response)) +
  geom_point() +
  labs(x = "Predictor", y = "Response") +
  geom_line(aes(y = predict(lin_fit)), color = "red") +
  geom_line(aes(y = predict(poly2_fit)), color = "slateblue") +
  geom_line(aes(y = predict(poly3_fit)), color = "seagreen", size = 2)

Fitting the model (poly(degree = 10))

Y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + ... + \beta_10 x_i^{10} + \epsilon_i

Code
poly10_fit <- lm(response ~ poly(predictor, 10), asymptotic)
Code
ggplot(asymptotic, aes(x = predictor, y = response)) +
  geom_point() +
  labs(x = "Predictor", y = "Response") +
  geom_line(aes(y = predict(lin_fit)), color = "red") +
  geom_line(aes(y = predict(poly2_fit)), color = "slateblue") +
  geom_line(aes(y = predict(poly3_fit)), color = "seagreen") +
  geom_line(aes(y = predict(poly10_fit)), color = "firebrick", size = 2)
Code
# 1) Compute R^2 values
R2_lin    <- summary(lin_fit)$r.squared
R2_poly2  <- summary(poly2_fit)$adj.r.squared
R2_poly3  <- summary(poly3_fit)$adj.r.squared
R2_poly10 <- summary(poly10_fit)$adj.r.squared

# 2) Create a simple table (data frame)
model_r2_table <- data.frame(
  Model = c("Linear", "Poly2", "Poly3", "Poly10"),
  R2    = c(R2_lin, R2_poly2, R2_poly3, R2_poly10)
)

# Display the table
knitr::kable(
  model_r2_table, 
  digits  = 3, 
  caption = "Comparison of R^2^ of Polynomial Models"
)
Comparison of R2 of Polynomial Models
Model R2
Linear 0.570
Poly2 0.820
Poly3 0.872
Poly10 0.862

Note

We use adjusted R2 for polynomials - extra terms, extra complexity, so extra penalty.

Limitations

  • Meaning of the coefficients is not always clear.
  • Extrapolation can be dangerous.
  • Extra terms can lead to overfitting and are difficult to interpret:
  • Parsimony: is the most complex term (highest power) significant? If not, remove.
Code
summary(poly10_fit)

Call:
lm(formula = response ~ poly(predictor, 10), data = asymptotic)

Residuals:
     Min       1Q   Median       3Q      Max 
-17.1659  -8.6908  -0.0494   8.8003  16.4012 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)             79.818      1.552  51.426  < 2e-16 ***
poly(predictor, 10)1   159.368     11.084  14.378  < 2e-16 ***
poly(predictor, 10)2  -106.939     11.084  -9.648 5.37e-12 ***
poly(predictor, 10)3    48.570     11.084   4.382 8.28e-05 ***
poly(predictor, 10)4   -19.411     11.084  -1.751   0.0876 .  
poly(predictor, 10)5     1.193     11.084   0.108   0.9148    
poly(predictor, 10)6    -2.769     11.084  -0.250   0.8040    
poly(predictor, 10)7    -1.343     11.084  -0.121   0.9042    
poly(predictor, 10)8    -4.009     11.084  -0.362   0.7195    
poly(predictor, 10)9    -2.851     11.084  -0.257   0.7984    
poly(predictor, 10)10    5.769     11.084   0.520   0.6056    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11.08 on 40 degrees of freedom
Multiple R-squared:  0.8897,    Adjusted R-squared:  0.8621 
F-statistic: 32.26 on 10 and 40 DF,  p-value: 4.846e-16

Still:

  • Easy to fit: just add polynomial terms to the model.
  • Simple to perform: use lm().

Nonlinear fitting

Fitting a nonlinear model

If you have some understanding of the underlying relationship (e.g. mechanistic process) between the variables, you can fit a nonlinear model.

Mathematical expression

Y_i = f(x_i, \beta) + \epsilon_i

where f(x_i, \beta) is a nonlinear function of the parameters \beta.

  • Y_i is the continuous response variable.
  • x_i is the vector of predictor variables.
  • \beta is the vector of unknown parameters.
  • \epsilon_i is the random error term (residual error).

Assumptions

Like the linear model, the nonlinear model assumes INE:

  • Error terms are independent (Independence).
  • Error terms are normally distributed (Normality).
  • Error terms have equal/constant variance (Homoscedasticity).

Basically:

\epsilon_i \sim N(0, \sigma^2)

Like all other models we have seen, we focus on the residuals to assess the model fit, since the residuals are the only part of the model that is random.

Estimating the model parameters

  • The parameters are estimated using the method of least squares.
  • For nonlinear models, a nonlinear optimization algorithm is used to find the best fit, rather than ordinary least squares, e.g.:
  • This can only be performed iteratively and depends on a “best guess” of the parameters as a start.
    • i.e. we need to provide a starting point for a nonlinear least squares algorithm to begin.

Source: Wikipedia

Implementation

use nls() function in R.

Code
nls(formula, data, start)
  • formula: a formula object, with the response variable on the left of a ~ operator, and the predictor variable(s) on the right.
  • data: a data frame containing the variables in the model.
  • start: a named list of starting values for the parameters in the model.

Example: Fitting an asymptotic model

Finding starting values

y = a + b(1 - e^{-cx})

  • a is value of y when x = 0.
  • b is the upper limit: the maximum value of y.
  • c is the rate of change: the rate at which y approaches the upper limit.
Code
ggplot(data = asymptotic, aes(x = predictor, y = response)) +
  geom_point() + 
  geom_hline(yintercept = 100, linetype = "dashed") +
  geom_vline(xintercept = 0, linetype = "dashed") +
  ## plot the rate
  geom_segment(aes(x = 0, y = 0, xend = 2.5, yend = 100), 
               arrow = arrow(length = unit(0.5, "cm")), 
               color = "red") +
  labs(x = "Predictor", y = "Response")
Warning in geom_segment(aes(x = 0, y = 0, xend = 2.5, yend = 100), arrow = arrow(length = unit(0.5, : All aesthetics have length 1, but the data has 51 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
  a single row.

First guess

y = a + b(1 - e^{-cx})

Code
fit_asymptotic <- nls(response ~ a + b*(1-exp(-c*predictor)), data = asymptotic, 
  start = list(a = 0, b = 100, c = 0.8))
Code
ggplot(data = asymptotic, aes(x = predictor, y = response)) +
  geom_point() + 
  geom_hline(yintercept = 100, linetype = "dashed") +
  geom_vline(xintercept = 0, linetype = "dashed") +
  labs(x = "Predictor", y = "Response") +
  geom_line(aes(y = predict(fit_asymptotic)), color = "red", size = 2)

Check assumptions

Code
library(nlstools)
resids <- nlsResiduals(fit_asymptotic)
plot(resids)

Interpretation

Code
summary(fit_asymptotic)

Formula: response ~ a + b * (1 - exp(-c * predictor))

Parameters:
   Estimate Std. Error t value Pr(>|t|)    
a -14.51755    6.64162  -2.186   0.0337 *  
b 113.03797    6.44716  17.533  < 2e-16 ***
c   0.62962    0.07142   8.816 1.32e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 10.21 on 48 degrees of freedom

Number of iterations to convergence: 4 
Achieved convergence tolerance: 1.178e-06
  • The model is significant since the p-value is less than 0.05 for all parameters.
  • If this were real data (e.g. population growth), the parameters themselves e.g. rate of change, are useful
  • The parameterised model is:

y = -14.5 + 113.04(1 - e^{-0.63x}) The R-squared value is not reported for nonlinear models as the sum of squares is not partitioned into explained and unexplained components. You can use the residual standard error and plots instead to compare between models.

A really bad guess

What if we don’t estimate our parameters very well? R will either give an error or get there eventually.

Code
fit_asymptotic <- nls(response ~ a + b*(1-exp(-c*predictor)), data = asymptotic, 
  start = list(a = -100, b = 200, c = 15))

summary(fit_asymptotic)

Formula: response ~ a + b * (1 - exp(-c * predictor))

Parameters:
   Estimate Std. Error t value Pr(>|t|)    
a -14.51760    6.64163  -2.186   0.0337 *  
b 113.03800    6.44717  17.533  < 2e-16 ***
c   0.62962    0.07142   8.816 1.32e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 10.21 on 48 degrees of freedom

Number of iterations to convergence: 7 
Achieved convergence tolerance: 2.967e-07

Tip

If an error pops up, try different starting values - the rate of change is most likely the problem.

Another example: Fitting a logistic model

Recap on logistic relationship

y = c + \frac{d-c}{1+e^{-b(x-a)}}

Code
ggplot(data = logistic, aes(x = predictor, y = response)) +
  geom_point() +
  geom_smooth() +
  labs(x = "Predictor", y = "Response")
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

S or sigmoid-shaped curve

Finding the starting values

y = c + \frac{d-c}{1+e^{-b(x-a)}}

  • c is the lower limit: the minimum value of y.
  • d is the upper limit: the maximum value of y.
  • a is the value of x when y is halfway between the lower and upper limits.
  • b is the rate of change: the rate at which y approaches the upper limit.
Code
ggplot(data = logistic, aes(x = predictor, y = response)) +
  geom_point() +
  geom_smooth() +
  labs(x = "Predictor", y = "Response") +
  geom_hline(yintercept = 0, linetype = "dashed") +
  geom_hline(yintercept = 300, linetype = "dashed") +
  geom_vline(xintercept = 5, linetype = "dashed") +
  # label the lines above
  annotate("text", x = 9, y = 0, label = "c", size = 8, vjust = -1) +
  annotate("text", x = 0, y = 300, label = "d", size = 8, vjust = 1.5) +
  annotate("text", x = 5, y = 100, label = "a", size = 8, hjust = -1) +
  ## plot the rate
  geom_segment(aes(x = 2.5, y = 60, xend = 6, yend = 250), 
               arrow = arrow(length = unit(0.5, "cm")), 
               color = "red") +
  # label the rate
  annotate("text", x = 4, y = 180, label = "b", size = 8, colour = "red", hjust = -1)
Warning in geom_segment(aes(x = 2.5, y = 60, xend = 6, yend = 250), arrow = arrow(length = unit(0.5, : All aesthetics have length 1, but the data has 51 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
  a single row.
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Finding the starting values

y = c + \frac{d-c}{1+e^{-b(x-a)}}

  • c is the lower limit: the minimum value of y.
  • d is the upper limit: the maximum value of y.
  • a is the value of x when y is halfway between the lower and upper limits.
  • b is the rate of change: the rate at which y approaches the upper limit.
Code
fit_logistic <- nls(
  response ~ c + (d - c) / (1 + exp(-b * (predictor - a))),
  data = logistic,
  start = list(c = 20, d = 300, b = 5, a = 5)
)

Note

It can be done this way, it is not the easiest way to do it.

How about a self-starting function?

  • The nls() function requires a formula and starting parameters.


Self-starting functions

  • Several self-starting functions are available in R that can be used to estimate the starting values.
  • These functions are named after the model they fit, e.g. SSasymp(), SSlogis(), SSexp etc.

Important

We still need to have some understanding of the underlying relationship between the variables to pick the right function.

Revisiting the logistic model

y = c + \frac{d-c}{1+e^{-b(x-a)}}

Slightly different equation – logistic growth curve that the lowest possible value is 0 (c = 0).

y = \frac{Asym}{1+exp \frac{xmid-input}{scal}}

  • input: the predictor variable.
  • Asym: the upper limit.
  • xmid: the value of x when y is halfway between the lower and upper limits.
  • scal: the rate of change.
Code
SSlogis(input, Asym, xmid, scal)

Other than input, the other parameters can be left to the function to estimate.

Code
fit_logistic_ss <- nls(response ~ SSlogis(predictor, Asym, xmid, scal), data = logistic)

What does the fit look like?

Code
ggplot(data = logistic, aes(x = predictor, y = response)) +
  geom_point() +
  labs(x = "Predictor", y = "Response") +
  geom_line(aes(y = predict(fit_logistic)), color = "red", size = 2) +
  geom_line(aes(y = predict(fit_logistic_ss)), color = "blue", size = 1)

Check the fit

Code
resids <- nlsResiduals(fit_logistic)
plot(resids)

Interpretation

SSlogis() guessed the parameters on the first try.

Code
summary(fit_logistic)

Formula: response ~ c + (d - c)/(1 + exp(-b * (predictor - a)))

Parameters:
   Estimate Std. Error t value Pr(>|t|)    
c  12.41294    4.30916   2.881  0.00596 ** 
d 304.72555    4.32716  70.422  < 2e-16 ***
b   0.83920    0.04914  17.076  < 2e-16 ***
a   5.00741    0.06806  73.570  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9.64 on 47 degrees of freedom

Number of iterations to convergence: 6 
Achieved convergence tolerance: 8.471e-07

Back to asymptotic model

Comparing manual fit to self-starting function

The self-starting function for the asymptotic model is SSasymp().

Code
fit_asymptotic_ss <- nls(response ~ SSasymp(predictor, Asym, R0, lrc), data = asymptotic)

Comparing outputs:

Code
ggplot(data = asymptotic, aes(x = predictor, y = response)) +
  geom_point() +
  labs(x = "Predictor", y = "Response") +
  geom_line(aes(y = predict(fit_asymptotic)), color = "red", size = 2) +
  geom_line(aes(y = predict(fit_asymptotic_ss)), color = "blue", size = 1)

In some cases, the fits are identical, but in others, they are not.

How do we know which model is better? (Advanced)

Note: this is non-examinable content but might be useful for your project.

Example: polynomial regression

Code
library(tidyr)

# Create a new data frame with predictor values and model predictions
predictions <- data.frame(
  predictor = asymptotic$predictor,
  Linear = predict(lin_fit),
  Poly_2 = predict(poly2_fit),
  Poly_3 = predict(poly3_fit),
  Poly_10 = predict(poly10_fit)
)

# Reshape the data to long format
predictions_long <- predictions %>%
  pivot_longer(cols = -predictor, names_to = "Model", values_to = "response")

# Plot the data
ggplot(predictions_long, aes(x = predictor, y = response, color = Model)) +
  geom_point(data = asymptotic, aes(x = predictor, y = response), inherit.aes = FALSE) +
  geom_line(linewidth = 1) +
  labs(x = "Predictor", y = "Response") +
  scale_color_brewer(palette = "Spectral") 

Prediction quality

We can use prediction quality metrics to compare the fits.

AIC and BIC

Use the broom package to extract the AIC and BIC values from the model fits.

Code
library(broom)
# collect all polynomial fits into a single tibble using glance
poly_fits <- tibble(
  model = c("linear", "poly2", "poly3", "poly10"),
  fit = list(lin_fit, poly2_fit, poly3_fit, poly10_fit)) %>%
  mutate(glance = map(fit, glance)) %>%
  unnest(glance) %>%
  select(model, AIC, BIC)
poly_fits
# A tibble: 4 × 3
  model    AIC   BIC
  <chr>  <dbl> <dbl>
1 linear  453.  459.
2 poly2   409.  416.
3 poly3   392.  402.
4 poly10  402.  425.
  • The smaller the AIC or BIC, the better the fit compared to other models.

Calculate RMSE and MAE

Code
predictions <- data.frame(
  observed = asymptotic$response,
  Linear = predict(lin_fit),
  Poly_2 = predict(poly2_fit),
  Poly_3 = predict(poly3_fit),
  Poly_10 = predict(poly10_fit)
)

errors <- predictions %>%
  pivot_longer(cols = -observed, names_to = "Model", values_to = "Predicted") %>%
  group_by(Model) %>%
  summarise(
    RMSE = sqrt(mean((observed - Predicted)^2)),
    MAE = mean(abs(observed - Predicted))
)

knitr::kable(errors, digits=2, caption = "Comparison of RMSE and MAE for different models")
Comparison of RMSE and MAE for different models
Model RMSE MAE
Linear 19.38 15.17
Poly_10 9.82 8.57
Poly_2 12.30 9.88
Poly_3 10.25 8.83
  • From the results, the polynomial to the degree of 10 has the lowest error - but visually we know it is overfitting, and the cubic polynomial is more parsimonius.
  • We can say the model has a prediction error of 10.25 units (RMSE) and 8.83 units (MAE).

Note

Both the RMSE and MAE measure error on the same scale as the response variable. e.g. if the response variable is in kg, the error will be in kg.

Summary

  • When fitting a nonlinear model, there are three possible approaches:
    1. Linearise the relationship by transforming the response or predictor:
      • Fit: easy
      • Interpret: difficult
    2. Approximate the model by adding polynomial terms:
      • Fit: easy
      • Interpret: difficult
    3. Fit the model using a nonlinear least squares algorithm:
      • Fit: difficult
      • Interpret: easy

  • Nonlinear models:
    • Useful for modelling more complex relationships.
    • Require some understanding of the underlying relationship (especially asympotic and logistic models)
    • Mainly for prediction rather than interpreting relationships.

Thanks!

This presentation is based on the SOLES Quarto reveal.js template and is licensed under a Creative Commons Attribution 4.0 International License.